0.0.1 Read in Data

data<-read.csv("data/yelp_data_reformat.csv")

0.1 Filter to look at unique businesses

Filtering the data down so we have one row per business and don’t risk double counting or skewing results while analyzing the affect of business characteristics on their average rateing

business_data <- data %>%
    select(Business...Id,Business...Stars,Business...Review.Count,Business...Wi.Fi,Business...Waiter.Service,Business...Take.out,Business...Price.Range,Business...Parking,Business...Noise.Level,Business...Good.For.Kids,Business...Accepts.Credit.Cards,Business...Ages.Allowed,Business...Has.TV,Business...Categories) %>%
    mutate(Price.Range = factor(Business...Price.Range),
           Business...Review.Count = as.numeric(Business...Review.Count)) %>%
    distinct()
package ‘bindrcpp’ was built under R version 3.4.4
head(business_data)

0.1.1 Clean Data

Perform some basic data cleaning and validations

constant_cols <- whichAreConstant(business_data)
[1] "whichAreConstant: it took me 0.04s to identify 0 constant column(s)"
double_cols <- whichAreInDouble(business_data)
[1] "whichAreInDouble: it took me 0.14s to identify 0 column(s) to drop."
bijections_cols <- whichAreBijection(business_data)
[1] "whichAreBijection: Price.Range is a bijection of Business...Price.Range. I put it in drop list."

   whichAreBijection [==========>------------]  47% in  0s 

   whichAreBijection [===========>-----------]  53% in  0s 

   whichAreBijection [=============>---------]  60% in  0s 

   whichAreBijection [==============>--------]  67% in  0s 

   whichAreBijection [================>------]  73% in  0s 

   whichAreBijection [=================>-----]  80% in  0s 

   whichAreBijection [===================>---]  87% in  0s 

   whichAreBijection [====================>--]  93% in  0s 
[1] "whichAreBijection: it took me 0.39s to identify 1 column(s) to drop."

0.1.2 Key metrics

Perhaps more interesting than reviews to both individual business and yelp as a platform is how reviews drive traffic and patronage. While the connection may seem obvious between high reviews and patronage/interaction it would be ideal to find a metric that was a closer proxy for actual interaction. With that in mind I’d like to look at the number of reviews a dependent variable as perhaps a closer proxy for how many people are actually visiting the establishment.

g<- ggplot(business_data,aes(x=Business...Review.Count)) +
  geom_density(alpha=.3,fill="#D32323",color="#D32323")
ggplotly(g)
g2<-ggplot(business_data,aes(x=Business...Stars)) +
  geom_histogram(alpha=.3,fill="#D32323",color="#D32323")
ggplotly(g2)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

0.2 Looking at Number of Reviews

business_reg <- lm(Business...Review.Count~Price.Range + Business...Stars + Business...Wi.Fi+Business...Noise.Level + Price.Range*Business...Wi.Fi+ Business...Good.For.Kids+Business...Has.TV,data=business_data)
summary(business_reg)

Call:
lm(formula = Business...Review.Count ~ Price.Range + Business...Stars + 
    Business...Wi.Fi + Business...Noise.Level + Price.Range * 
    Business...Wi.Fi + Business...Good.For.Kids + Business...Has.TV, 
    data = business_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-261.36  -71.95  -18.84   38.52 1010.95 

Coefficients: (4 not defined because of singularities)
                                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)                       -258.5275    51.2430  -5.045 5.99e-07 ***
Price.Range2                        30.9450    37.5436   0.824   0.4101    
Price.Range3                       -40.5716    95.8266  -0.423   0.6722    
Price.Range4                         8.5114    60.3056   0.141   0.8878    
Business...Stars                   103.5992    10.7838   9.607  < 2e-16 ***
Business...Wi.Fifree                30.8771    33.5198   0.921   0.3573    
Business...Wi.Fino                  29.1342    29.2007   0.998   0.3188    
Business...Wi.Fipaid                51.4933    69.7953   0.738   0.4609    
Business...Noise.Levelaverage       16.8431    40.5900   0.415   0.6783    
Business...Noise.Levelloud           0.9395    44.8121   0.021   0.9833    
Business...Noise.Levelquiet        -63.6865    42.3624  -1.503   0.1333    
Business...Noise.Levelvery_loud    -11.0146    50.0898  -0.220   0.8260    
Business...Good.For.KidsTRUE       -61.5745    13.2087  -4.662 3.86e-06 ***
Business...Has.TVTRUE              -22.7518    10.9719  -2.074   0.0385 *  
Price.Range2:Business...Wi.Fifree   77.5759    44.1901   1.756   0.0797 .  
Price.Range3:Business...Wi.Fifree  115.5734   105.2879   1.098   0.2728    
Price.Range4:Business...Wi.Fifree   76.5102    96.5360   0.793   0.4283    
Price.Range2:Business...Wi.Fino     45.2604    39.6692   1.141   0.2543    
Price.Range3:Business...Wi.Fino    105.7416    99.3487   1.064   0.2876    
Price.Range4:Business...Wi.Fino          NA         NA      NA       NA    
Price.Range2:Business...Wi.Fipaid        NA         NA      NA       NA    
Price.Range3:Business...Wi.Fipaid        NA         NA      NA       NA    
Price.Range4:Business...Wi.Fipaid        NA         NA      NA       NA    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 128.4 on 609 degrees of freedom
  (5145 observations deleted due to missingness)
Multiple R-squared:  0.2946,    Adjusted R-squared:  0.2737 
F-statistic: 14.13 on 18 and 609 DF,  p-value: < 2.2e-16
plot_model(business_reg)

business_reg <- lm(Business...Review.Count~Price.Range + Business...Stars + Business...Wi.Fi+Business...Noise.Level + Price.Range*Business...Wi.Fi,data=business_data)
summary(business_reg)

Call:
lm(formula = Business...Review.Count ~ Price.Range + Business...Stars + 
    Business...Wi.Fi + Business...Noise.Level + Price.Range * 
    Business...Wi.Fi, data = business_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-137.83  -35.90  -10.68   16.10 1178.79 

Coefficients: (1 not defined because of singularities)
                                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)                        -77.108      5.804 -13.285  < 2e-16 ***
Price.Range2                         4.200      4.368   0.962 0.336320    
Price.Range3                        15.511     12.470   1.244 0.213580    
Price.Range4                        12.457     34.198   0.364 0.715675    
Business...Stars                    24.596      1.554  15.827  < 2e-16 ***
Business...Wi.Fifree                16.211      4.178   3.880 0.000106 ***
Business...Wi.Fino                  13.991      3.491   4.008 6.21e-05 ***
Business...Wi.Fipaid                 6.448     20.567   0.313 0.753918    
Business...Noise.Levelaverage       26.721      3.752   7.122 1.20e-12 ***
Business...Noise.Levelloud          19.800      5.251   3.771 0.000165 ***
Business...Noise.Levelquiet         -5.598      4.136  -1.353 0.175974    
Business...Noise.Levelvery_loud      2.014      7.043   0.286 0.774955    
Price.Range2:Business...Wi.Fifree   49.822      6.030   8.262  < 2e-16 ***
Price.Range3:Business...Wi.Fifree   74.585     18.772   3.973 7.18e-05 ***
Price.Range4:Business...Wi.Fifree   82.069     41.958   1.956 0.050517 .  
Price.Range2:Business...Wi.Fino     44.020      5.334   8.252  < 2e-16 ***
Price.Range3:Business...Wi.Fino     70.727     15.284   4.627 3.79e-06 ***
Price.Range4:Business...Wi.Fino     83.851     39.520   2.122 0.033904 *  
Price.Range2:Business...Wi.Fipaid   26.573     28.255   0.940 0.347014    
Price.Range3:Business...Wi.Fipaid    9.150     41.650   0.220 0.826119    
Price.Range4:Business...Wi.Fipaid       NA         NA      NA       NA    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 76.24 on 5441 degrees of freedom
  (312 observations deleted due to missingness)
Multiple R-squared:  0.2276,    Adjusted R-squared:  0.2249 
F-statistic: 84.36 on 19 and 5441 DF,  p-value: < 2.2e-16
plot_model(business_reg)

0.3 Looking at Rating

business_reg <- lm(Business...Stars~Price.Range  + Business...Wi.Fi+Business...Noise.Level,data=business_data)
summary(business_reg)

Call:
lm(formula = Business...Stars ~ Price.Range + Business...Wi.Fi + 
    Business...Noise.Level, data = business_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.58667 -0.44423 -0.07531  0.41333  1.92030 

Coefficients:
                                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)                      3.079701   0.027082 113.718  < 2e-16 ***
Price.Range2                     0.011354   0.018836   0.603   0.5467    
Price.Range3                     0.123862   0.055099   2.248   0.0246 *  
Price.Range4                     0.158233   0.122134   1.296   0.1952    
Business...Wi.Fifree             0.188803   0.027175   6.948 4.14e-12 ***
Business...Wi.Fino               0.174516   0.024168   7.221 5.87e-13 ***
Business...Wi.Fipaid             0.002281   0.114320   0.020   0.9841    
Business...Noise.Levelaverage    0.321096   0.032331   9.932  < 2e-16 ***
Business...Noise.Levelloud       0.053050   0.045684   1.161   0.2456    
Business...Noise.Levelquiet      0.364531   0.035525  10.261  < 2e-16 ***
Business...Noise.Levelvery_loud -0.148999   0.061313  -2.430   0.0151 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6649 on 5450 degrees of freedom
  (312 observations deleted due to missingness)
Multiple R-squared:  0.07131,   Adjusted R-squared:  0.0696 
F-statistic: 41.85 on 10 and 5450 DF,  p-value: < 2.2e-16
plot_model(business_reg)

0.4 Momentum of Ratings

We want to look at the affect a businesses current rating has on it’s ability to attract new customers and it’s ability to improve it’s overall rating. There are definately better ways to do this but for now at each point of time for which we have info we calculate the average rating at that point, the number of reviews the recieve after that point and their average rating after that point.

LS0tCm91dHB1dDogCiAgaHRtbF9ub3RlYm9vazogCiAgICBudW1iZXJfc2VjdGlvbnM6IHllcwotLS0KPGltZyBzcmM9ImltZy95ZWxwLWxvZ28ucG5nIj48L2ltZz4KYGBge3Igc2V0dXAsIGluY2x1ZGU9RkFMU0V9CmtuaXRyOjpvcHRzX2NodW5rJHNldChlY2hvID0gVFJVRSkKbGlzdC5vZi5wYWNrYWdlcyA8LSBjKCd0aWR5cicsJ3Bsb3RseScsJ21sYmVuY2gnLCdjYXJldCcsJ1JPQ1InLCdlMTA3MScsJ2dncGxvdDInLCdkYXRhUHJlcGFyYXRpb24nLCdjb3JycGxvdCcsJ3NqUGxvdCcsJ3NqbWlzYycsJ3NxbGRmJykKbmV3LnBhY2thZ2VzIDwtIGxpc3Qub2YucGFja2FnZXNbIShsaXN0Lm9mLnBhY2thZ2VzICVpbiUgaW5zdGFsbGVkLnBhY2thZ2VzKClbLCJQYWNrYWdlIl0pXQppZihsZW5ndGgobmV3LnBhY2thZ2VzKSkgaW5zdGFsbC5wYWNrYWdlcyhuZXcucGFja2FnZXMpCmxhcHBseShsaXN0Lm9mLnBhY2thZ2VzLCByZXF1aXJlLCBjaGFyYWN0ZXIub25seT1UUlVFKQpgYGAKCiMjIyBSZWFkIGluIERhdGEKCgoKYGBge3IgZGF0YX0KZGF0YTwtcmVhZC5jc3YoImRhdGEveWVscF9kYXRhX3JlZm9ybWF0LmNzdiIpCmBgYAoKIyMgRmlsdGVyIHRvIGxvb2sgYXQgdW5pcXVlIGJ1c2luZXNzZXMKCj5GaWx0ZXJpbmcgdGhlIGRhdGEgZG93biBzbyB3ZSBoYXZlIG9uZSByb3cgcGVyIGJ1c2luZXNzIGFuZCBkb24ndCByaXNrIGRvdWJsZSBjb3VudGluZyBvciBza2V3aW5nIHJlc3VsdHMgd2hpbGUgYW5hbHl6aW5nIHRoZSBhZmZlY3Qgb2YgYnVzaW5lc3MgY2hhcmFjdGVyaXN0aWNzIG9uIHRoZWlyIGF2ZXJhZ2UgcmF0ZWluZwoKCgpgYGB7cn0KYnVzaW5lc3NfZGF0YSA8LSBkYXRhICU+JQogICAgc2VsZWN0KEJ1c2luZXNzLi4uSWQsQnVzaW5lc3MuLi5TdGFycyxCdXNpbmVzcy4uLlJldmlldy5Db3VudCxCdXNpbmVzcy4uLldpLkZpLEJ1c2luZXNzLi4uV2FpdGVyLlNlcnZpY2UsQnVzaW5lc3MuLi5UYWtlLm91dCxCdXNpbmVzcy4uLlByaWNlLlJhbmdlLEJ1c2luZXNzLi4uUGFya2luZyxCdXNpbmVzcy4uLk5vaXNlLkxldmVsLEJ1c2luZXNzLi4uR29vZC5Gb3IuS2lkcyxCdXNpbmVzcy4uLkFjY2VwdHMuQ3JlZGl0LkNhcmRzLEJ1c2luZXNzLi4uQWdlcy5BbGxvd2VkLEJ1c2luZXNzLi4uSGFzLlRWLEJ1c2luZXNzLi4uQ2F0ZWdvcmllcykgJT4lCiAgICBtdXRhdGUoUHJpY2UuUmFuZ2UgPSBmYWN0b3IoQnVzaW5lc3MuLi5QcmljZS5SYW5nZSksCiAgICAgICAgICAgQnVzaW5lc3MuLi5SZXZpZXcuQ291bnQgPSBhcy5udW1lcmljKEJ1c2luZXNzLi4uUmV2aWV3LkNvdW50KSkgJT4lCiAgICBkaXN0aW5jdCgpCgoKaGVhZChidXNpbmVzc19kYXRhKQpgYGAKIyMjIENsZWFuIERhdGEKPlBlcmZvcm0gc29tZSBiYXNpYyBkYXRhIGNsZWFuaW5nIGFuZCB2YWxpZGF0aW9ucwoKYGBge3J9CmNvbnN0YW50X2NvbHMgPC0gd2hpY2hBcmVDb25zdGFudChidXNpbmVzc19kYXRhKQpkb3VibGVfY29scyA8LSB3aGljaEFyZUluRG91YmxlKGJ1c2luZXNzX2RhdGEpCmJpamVjdGlvbnNfY29scyA8LSB3aGljaEFyZUJpamVjdGlvbihidXNpbmVzc19kYXRhKQpgYGAKCiMjIyBLZXkgbWV0cmljcwo+IFBlcmhhcHMgbW9yZSBpbnRlcmVzdGluZyB0aGFuIHJldmlld3MgdG8gYm90aCBpbmRpdmlkdWFsIGJ1c2luZXNzIGFuZCB5ZWxwIGFzIGEgcGxhdGZvcm0gIGlzIGhvdyByZXZpZXdzIGRyaXZlIHRyYWZmaWMgYW5kIHBhdHJvbmFnZS4gV2hpbGUgdGhlIGNvbm5lY3Rpb24gbWF5IHNlZW0gb2J2aW91cyBiZXR3ZWVuIGhpZ2ggcmV2aWV3cyBhbmQgcGF0cm9uYWdlL2ludGVyYWN0aW9uIGl0IHdvdWxkIGJlIGlkZWFsIHRvIGZpbmQgYSBtZXRyaWMgdGhhdCB3YXMgYSBjbG9zZXIgcHJveHkgZm9yIGFjdHVhbCBpbnRlcmFjdGlvbi4gV2l0aCB0aGF0IGluIG1pbmQgSSdkIGxpa2UgdG8gbG9vayBhdCB0aGUgbnVtYmVyIG9mIHJldmlld3MgYSBkZXBlbmRlbnQgdmFyaWFibGUgYXMgcGVyaGFwcyBhIGNsb3NlciBwcm94eSBmb3IgaG93IG1hbnkgcGVvcGxlIGFyZSBhY3R1YWxseSB2aXNpdGluZyB0aGUgZXN0YWJsaXNobWVudC4KCmBgYHtyfQpnPC0gZ2dwbG90KGJ1c2luZXNzX2RhdGEsYWVzKHg9QnVzaW5lc3MuLi5SZXZpZXcuQ291bnQpKSArCiAgZ2VvbV9kZW5zaXR5KGFscGhhPS4zLGZpbGw9IiNEMzIzMjMiLGNvbG9yPSIjRDMyMzIzIikKZ2dwbG90bHkoZykKCmBgYApgYGB7cn0KZzI8LWdncGxvdChidXNpbmVzc19kYXRhLGFlcyh4PUJ1c2luZXNzLi4uU3RhcnMpKSArCiAgZ2VvbV9oaXN0b2dyYW0oYWxwaGE9LjMsZmlsbD0iI0QzMjMyMyIsY29sb3I9IiNEMzIzMjMiKQpnZ3Bsb3RseShnMikKYGBgCgojIyBMb29raW5nIGF0IE51bWJlciBvZiBSZXZpZXdzCgpgYGB7cn0KYnVzaW5lc3NfcmVnIDwtIGxtKEJ1c2luZXNzLi4uUmV2aWV3LkNvdW50flByaWNlLlJhbmdlICsgQnVzaW5lc3MuLi5TdGFycyArIEJ1c2luZXNzLi4uV2kuRmkrQnVzaW5lc3MuLi5Ob2lzZS5MZXZlbCArIFByaWNlLlJhbmdlKkJ1c2luZXNzLi4uV2kuRmkrIEJ1c2luZXNzLi4uR29vZC5Gb3IuS2lkcytCdXNpbmVzcy4uLkhhcy5UVixkYXRhPWJ1c2luZXNzX2RhdGEpCgpzdW1tYXJ5KGJ1c2luZXNzX3JlZykKCmBgYApgYGB7cn0KcGxvdF9tb2RlbChidXNpbmVzc19yZWcpCmBgYApgYGB7cn0KYnVzaW5lc3NfcmVnIDwtIGxtKEJ1c2luZXNzLi4uUmV2aWV3LkNvdW50flByaWNlLlJhbmdlICsgQnVzaW5lc3MuLi5TdGFycyArIEJ1c2luZXNzLi4uV2kuRmkrQnVzaW5lc3MuLi5Ob2lzZS5MZXZlbCArIFByaWNlLlJhbmdlKkJ1c2luZXNzLi4uV2kuRmksZGF0YT1idXNpbmVzc19kYXRhKQoKc3VtbWFyeShidXNpbmVzc19yZWcpCgpgYGAKYGBge3J9CnBsb3RfbW9kZWwoYnVzaW5lc3NfcmVnKQpgYGAKIyMgTG9va2luZyBhdCBSYXRpbmcKCmBgYHtyfQpidXNpbmVzc19yZWcgPC0gbG0oQnVzaW5lc3MuLi5TdGFyc35QcmljZS5SYW5nZSAgKyBCdXNpbmVzcy4uLldpLkZpK0J1c2luZXNzLi4uTm9pc2UuTGV2ZWwsZGF0YT1idXNpbmVzc19kYXRhKQoKc3VtbWFyeShidXNpbmVzc19yZWcpCgpgYGAKYGBge3J9CnBsb3RfbW9kZWwoYnVzaW5lc3NfcmVnKQpgYGAKCiMjIE1vbWVudHVtIG9mIFJhdGluZ3MKPiBXZSB3YW50IHRvIGxvb2sgYXQgdGhlIGFmZmVjdCBhIGJ1c2luZXNzZXMgY3VycmVudCByYXRpbmcgaGFzIG9uIGl0J3MgYWJpbGl0eSB0byBhdHRyYWN0IG5ldyBjdXN0b21lcnMgYW5kIGl0J3MgYWJpbGl0eSB0byBpbXByb3ZlIGl0J3Mgb3ZlcmFsbCByYXRpbmcuIFRoZXJlIGFyZSBkZWZpbmF0ZWx5IGJldHRlciB3YXlzIHRvIGRvIHRoaXMgYnV0IGZvciBub3cgYXQgZWFjaCBwb2ludCBvZiB0aW1lIGZvciB3aGljaCB3ZSBoYXZlIGluZm8gd2UgY2FsY3VsYXRlIHRoZSBhdmVyYWdlIHJhdGluZyBhdCB0aGF0IHBvaW50LCB0aGUgbnVtYmVyIG9mIHJldmlld3MgdGhlIHJlY2lldmUgYWZ0ZXIgdGhhdCBwb2ludCBhbmQgdGhlaXIgYXZlcmFnZSByYXRpbmcgYWZ0ZXIgdGhhdCBwb2ludC4KCg==